# PSEO Data Pipeline: SAS Scripts for Sample Selection and Enrollment Merging

This repository contains SAS programs used to process student-level administrative data from various state higher education systems (Colorado, New York, Ohio, and Texas). The pipeline is divided into two primary phases: identifying the study population (graduates and non-graduates) and merging longitudinal enrollment history.

## System Abbreviations
- **CDHE:** Colorado Department of Higher Education
- **CUNY:** City University of New York
- **SUNY:** State University of New York
- **ODHE:** Ohio Department of Higher Education
- **THECB:** Texas Higher Education Coordinating Board

---

## Phase 1: Sample Selection (`01.xx.sampleselect_*.sas`)

These scripts identify two groups of interest: graduates in Career and Technical Education (CTE) fields and a control group of non-graduates.

* **Logic:** * Defines a `%cte_flag` macro based on specific 2-digit CIP codes (e.g., Engineering, Health, Business).
    * Identifies graduates from 2-year institutions.
    * Identifies non-graduates who completed a minimum threshold of credits (typically 10+ hours).
* **Programs:**
    * `01.01.sampleselect_cdhe.sas`
    * `01.02.sampleselect_cuny.sas`
    * `01.03.sampleselect_suny.sas`
    * `01.04.sampleselect_odhe.sas`
    * `01.05.sampleselect_thecb.sas`

---

## Phase 2: Enrollment Merging (`02.xx.enrollmerge_*.sas`)

These scripts extract the longitudinal enrollment history for the students identified in Phase 1 to create both "long" and "wide" format datasets.

* **Logic:**
    * Merges the student sample with raw enrollment records (`enr_rsch` and `enr_qpik` tables).
    * Standardizes term codes (Fall, Winter, Spring, Summer) into numerical quarters.
    * Calculates `qtime` (quarterly time index) to track student progress over the years.
    * **Output Wide Files:** Creates a single row per student with array variables (`enr61-enr138`) indicating enrollment status in every quarter of the study period.
* **Programs:**
    * `02.01.enrollmerge_cdhe.sas`
    * `02.02.enrollmerge_cuny.sas`
    * `02.03.enrollmerge_suny.sas`
    * `02.04.enrollmerge_odhe.sas`
    * `02.05.enrollmerge_thecb.sas`
	
## Phase 3: Earnings & Demographic Integration (`03_earnmerge.sas`)

This is the final step of the SAS pipeline, aggregating data across all state systems and merging in external labor market and demographic variables.

* **Logic:**
    * **System Consolidation:** Combines the "wide" enrollment files from all systems (CDHE, THECB, CUNY, SUNY, ODHE) into a single master student file.
    * **Earnings Merge:** Integrates national earnings data and self-employment (SE) records.
    * **Demographics:** Merges with the Individual Characteristics File (ICF) to pull in Date of Birth (DOB), gender, and race/ethnicity.
    * **Variable Construction:** Creates final analysis variables including `logearn_national`, `logearn_state`, and indicators for being "only self-employed."
    * **Final Output:** Exports the processed dataset to Stata format (`allearnings_cte_regressions.dta`) for final econometric analysis.

---

## Data Pipeline Flow

1.  **Input:** Raw PSEO administrative files
2.  **Sample Filtering:** Filter for specific degree levels (Certificates/Associates) and vocational CIP codes.
3.  **Reshaping:** Transition from transaction-level records to a longitudinal wide format.
4.  **Integration:** Merge with ICF (demographics) and UI/National earnings records.
5.  **Export:** Produce the finalized `.dta` file for regression modeling.

## Requirements
- **SAS Environment:** Requires access to secure `INPUTS`, `OUTPUTS`, and `ICF` libraries.